在现实世界中,教授多指的灵巧机器人在现实世界中掌握物体,这是一个充满挑战的问题,由于其高维状态和动作空间。我们提出了一个机器人学习系统,该系统可以进行少量的人类示范,并学会掌握在某些被遮挡的观察结果的情况下掌握看不见的物体姿势。我们的系统利用了一个小型运动捕获数据集,并为多指的机器人抓手生成具有多种多样且成功的轨迹的大型数据集。通过添加域随机化,我们表明我们的数据集提供了可以将其转移到策略学习者的强大抓地力轨迹。我们训练一种灵活的抓紧策略,该策略将对象的点云作为输入,并预测连续的动作以从不同初始机器人状态掌握对象。我们在模拟中评估了系统对22多伏的浮动手的有效性,并在现实世界中带有kuka手臂的23多杆Allegro机器人手。从我们的数据集中汲取的政策可以很好地概括在模拟和现实世界中的看不见的对象姿势
translated by 谷歌翻译
水下图像由于光吸收,折射率和散射而受到颜色铸造,低对比度和朦胧效果,这些效果降低了高级应用,例如,对象检测和对象跟踪。最新的基于学习的方法证明了水下图像增强的惊人性能,但是,这些作品中的大多数使用合成对数据进行监督学习,并忽略了对现实世界数据的域间隙。为了解决这个问题,我们提出了一个通过内容和样式分离来增强水下图像的域适应框架,不同于水下图像增强的域适应性作品,该框架的目标是最大程度地减少合成和现实世界的潜在差异,我们的目标将编码功能分离为内容和样式潜在的样式,并将潜在的样式与不同的域,即合成,现实世界的水下和清洁域以及潜在空间中的过程域的适应和图像增强。通过潜在的操纵,我们的模型提供了一个用户交互接口,以调整不同的增强级别以进行连续更改。对各种公共水下基准测试的实验表明,所提出的框架能够对水下图像增强的域进行适应性,并在数量和质量方面胜过各种最先进的水下图像增强算法。该型号和源代码将在https://github.com/fordevoted/uiess上找到
translated by 谷歌翻译
我们介绍了几个弹出的对象学习(LITESOL)数据集,以供对象识别,每个对象有几个图像。我们从不同的视图中捕获了336个现实世界对象,每个对象有9个RGB-D图像。提供对象分割掩码,对象姿势和对象属性。此外,使用330 3D对象模型生成的合成图像用于增强数据集。我们研究了(i)使用我们的数据集的最先进的方法和最新方法,研究了(ii)(ii)使用最先进的方法和元学习的最先进方法的联合对象分割和几乎没有射击分类。评估结果表明,在机器人环境中,对于几个射击对象分类,仍有很大的边距可以改善。我们的数据集可用于研究一组几个弹出的对象识别问题,例如分类,检测和分割,形状重建,姿势估计,关键点对应关系和属性识别。该数据集和代码可在https://irvlutd.github.io/fewsol上找到。
translated by 谷歌翻译
机器人需要能够从用户学习概念,以便将其功能调整到每个用户的唯一任务。但是当机器人在高维输入上运行时,如图像或点云,这是不切实际的:机器人需要一种不切实际的人类努力来学习新概念。为了解决这一挑战,我们提出了一种新方法,其中机器人学习概念的低维变体,并使用它来生成更大的数据集,用于在高维空间中学习概念。这使得只有在训练时间等地访问的语义上有意义的特权信息,如对象姿势和边界框,这允许更丰富的人类交互来加速学习。我们通过学习介词概念来评估我们的方法,这些概念描述了对象状态或多对象关系,如上面,近,近或对齐,这是用户规范任务目标和机器人的执行约束的关键。使用模拟人类,我们表明,与直接在高维空间中的学习概念相比,我们的方法可以提高样本复杂性。我们还展示了学习概念在7 DOF法兰卡熊猫机器人上的运动规划任务中的效用。
translated by 谷歌翻译
Deep learning models can achieve high accuracy when trained on large amounts of labeled data. However, real-world scenarios often involve several challenges: Training data may become available in installments, may originate from multiple different domains, and may not contain labels for training. Certain settings, for instance medical applications, often involve further restrictions that prohibit retention of previously seen data due to privacy regulations. In this work, to address such challenges, we study unsupervised segmentation in continual learning scenarios that involve domain shift. To that end, we introduce GarDA (Generative Appearance Replay for continual Domain Adaptation), a generative-replay based approach that can adapt a segmentation model sequentially to new domains with unlabeled data. In contrast to single-step unsupervised domain adaptation (UDA), continual adaptation to a sequence of domains enables leveraging and consolidation of information from multiple domains. Unlike previous approaches in incremental UDA, our method does not require access to previously seen data, making it applicable in many practical scenarios. We evaluate GarDA on two datasets with different organs and modalities, where it substantially outperforms existing techniques.
translated by 谷歌翻译
The development of social media user stance detection and bot detection methods rely heavily on large-scale and high-quality benchmarks. However, in addition to low annotation quality, existing benchmarks generally have incomplete user relationships, suppressing graph-based account detection research. To address these issues, we propose a Multi-Relational Graph-Based Twitter Account Detection Benchmark (MGTAB), the first standardized graph-based benchmark for account detection. To our knowledge, MGTAB was built based on the largest original data in the field, with over 1.55 million users and 130 million tweets. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. In MGTAB, we extracted the 20 user property features with the greatest information gain and user tweet features as the user features. In addition, we performed a thorough evaluation of MGTAB and other public datasets. Our experiments found that graph-based approaches are generally more effective than feature-based approaches and perform better when introducing multiple relations. By analyzing experiment results, we identify effective approaches for account detection and provide potential future research directions in this field. Our benchmark and standardized evaluation procedures are freely available at: https://github.com/GraphDetec/MGTAB.
translated by 谷歌翻译
As one of the prevalent methods to achieve automation systems, Imitation Learning (IL) presents a promising performance in a wide range of domains. However, despite the considerable improvement in policy performance, the corresponding research on the explainability of IL models is still limited. Inspired by the recent approaches in explainable artificial intelligence methods, we proposed a model-agnostic explaining framework for IL models called R2RISE. R2RISE aims to explain the overall policy performance with respect to the frames in demonstrations. It iteratively retrains the black-box IL model from the randomized masked demonstrations and uses the conventional evaluation outcome environment returns as the coefficient to build an importance map. We also conducted experiments to investigate three major questions concerning frames' importance equality, the effectiveness of the importance map, and connections between importance maps from different IL models. The result shows that R2RISE successfully distinguishes important frames from the demonstrations.
translated by 谷歌翻译
Compressed videos often exhibit visually annoying artifacts, known as Perceivable Encoding Artifacts (PEAs), which dramatically degrade video visual quality. Subjective and objective measures capable of identifying and quantifying various types of PEAs are critical in improving visual quality. In this paper, we investigate the influence of four spatial PEAs (i.e. blurring, blocking, bleeding, and ringing) and two temporal PEAs (i.e. flickering and floating) on video quality. For spatial artifacts, we propose a visual saliency model with a low computational cost and higher consistency with human visual perception. In terms of temporal artifacts, self-attention based TimeSFormer is improved to detect temporal artifacts. Based on the six types of PEAs, a quality metric called Saliency-Aware Spatio-Temporal Artifacts Measurement (SSTAM) is proposed. Experimental results demonstrate that the proposed method outperforms state-of-the-art metrics. We believe that SSTAM will be beneficial for optimizing video coding techniques.
translated by 谷歌翻译
We propose a distributionally robust return-risk model for Markov decision processes (MDPs) under risk and reward ambiguity. The proposed model optimizes the weighted average of mean and percentile performances, and it covers the distributionally robust MDPs and the distributionally robust chance-constrained MDPs (both under reward ambiguity) as special cases. By considering that the unknown reward distribution lies in a Wasserstein ambiguity set, we derive the tractable reformulation for our model. In particular, we show that that the return-risk model can also account for risk from uncertain transition kernel when one only seeks deterministic policies, and that a distributionally robust MDP under the percentile criterion can be reformulated as its nominal counterpart at an adjusted risk level. A scalable first-order algorithm is designed to solve large-scale problems, and we demonstrate the advantages of our proposed model and algorithm through numerical experiments.
translated by 谷歌翻译
Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.
translated by 谷歌翻译